Project: Investigating European Soccer Database

Table of Contents

Introduction

This database comes from Kaggle. It contains data for soccer matches, players, and teams from several European countries from 2008 to 2016.

In this investigation we are going to try to answer these questions:

1- What teams improved the most over the time period?
2- Which players had the most penalties?
3- What team attributes lead to the most victories?

Data Wrangling

I have explored the database using db browser,

It contains 7 tables: Country, League, Match, Player, Player_Attributes, Team, Team_Attributes.

Let's load all tables from the database and save them in dataframes.

Exploring all dataframes contents

Rename some columns to prevent confusion

Joining Team, Country, League & Match dataframes into one df

Joining Player & Player_Attributes dataframes into one dataframe and selecting needed columns

# Joining Team & Team_Attributes dataframes into one dataframe and selecting needed columns

Data Cleaning

Now we joinrd all tables into three dataframes: all_teams, all_players, all_matches

Let's clean them

1- Handling duplicates

Check for duplicates in all_players dataframe

Check for duplicates in all_teams

So, no duplicates in all_matches or all_teams dataframes

2- Handling NAN values

Check for nulls in all_players dataframe

The null values in penalties means the player has never played a penalty and so his penalty attribute should be zero

Replacing nulls in penalties columns with zeros, meaning that the player has not played any penalty

Check for nulls in all_teams dataframe

The buildUpPlayDribbling column is almost all zeros Let's drop it

Exploratory Data Analysis

1- What teams improved the most over the time period?

Selecting all numeric attributes

Summing all numerical attributes into a new column (overall_quality)

Group the teams by date

Get a dataframe with minimum (oldest) dates of each team and another dataframe with the newest dates

Get improvement dataframe by subtracting the newest and oldest

Plot the Top 10 most improved teams

So the first question is clearly answered as the Top 10 most improved teams are:

['Swansea City', 'Queens Park Rangers', 'Birmingham City', 'Kilmarnock', 'Cardiff City', 'Manchester City', 'Empoli', 'Livorno', 'Hibernian', 'Motherwell']

2- Which players had the most penalties?

As shown from the table, the Top player in penalties is: Rickie Lambert

3- What team attributes lead to the most victories?

Plot heatmap to show the relation between all atributes and number of goals

Plot scatter plots between number of goals and each attribute

As we see from the charts, the most important attributes for winning are: ['chanceCreationShooting', 'chanceCreationCrossing', 'buildUpPlaySpeed']

Conclusions

Some limitations: